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Abstract 



Some of the most compelling applications of online convex optimization, includ- 
ing online prediction and classification, are unconstrained: the natural feasible set 
is K™. Existing algorithms fail to achieve sub-linear regret in this setting unless 
constraints on the comparator point x are known in advance. We present algo- 
rithms that, without such prior knowledge, offer near-optimal regret bounds with 
respect to any choice of x. In particular, regret with respect to x = is constant. 
We then prove lower bounds showing that our guarantees are near-optimal in this 
setting. 

1 Introduction 

Over the past several years, online convex optimization has emerged as a fundamental tool for solv- 
ing problems in machine learning (see, e.g., (3][T2|| for an introduction). The reduction from general 
online convex optimization to online linear optimization means that simple and efficient (in memory 
and time) algorithms can be used to tackle large-scale machine learning problems. The key theoret- 
ical techniques behind essentially all the algorithms in this field are the use of a fixed or increasing 
strongly convex regularizer (for gradient descent algorithms, this is equivalent to a fixed or decreas- 
ing learning rate sequence). In this paper, we show that a fundamentally different type of algorithm 
can offer significant advantages over these approaches. Our algorithms adjust their learning rates 
based not just on the number of rounds, but also based on the sum of gradients seen so far. This 
allows us to start with small learning rates, but effectively increase the learning rate if the problem 
instance warrants it. 

This approach produces regret bounds of the form 0[R\/T log((l + i?)T)), where R — \\x\\2 is the 
L2 norm of an arbitrary comparator. Critically, our algorithms provide this guarantee simultaneously 
for all x £ K™, without any need to know R in advance. A consequence of this is that we can 
guarantee at most constant regret with respect to the origin, x = 0. This technique can be applied to 
any online convex optimization problem where a fixed feasible set is not an essential component of 
the problem. We discuss two applications of particular interest below: 

Online Prediction Perhaps the single most important application of online convex optimization 
is the following prediction setting: the world presents an attribute vector a t £ R"; the prediction 
algorithm produces a prediction a(a t ■ Xt), where x% £ K™ represents the model parameters, and 
cr : K — > Y maps the linear prediction into the appropriate label space. Then, the adversary reveals 
the label y t £ Y, and the prediction is penalized according to a loss function £ : Y x Y — >• BL 
For appropriately chosen a and t, this becomes a problem of online convex optimization against 
functions f t (x) = £(a(a,fx),y t ). In this formulation, there are no inherent restrictions on the model 
coefficients x £ M. n . The practitioner may have prior knowledge that "small" model vectors are more 

"This work was performed while the author was at Google. 
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likely than large ones, but this is rarely best encoded as a feasible set T, which says: "all x t £ T are 
equally likely, and all other x t are ruled out." A more general strategy is to introduce a fixed convex 
regularizer: L\ and L\ penalties are common, but domain-specific choices are also possible. While 
algorithms of this form have proved very effective at solving these problems, theoretical guarantees 
usually require fixing a feasible set of radius R, or at least an intelligent guess of the norm of an 
optimal comparator x. 

The Unconstrained Experts Problem and Portfolio Management In the classic problem of 
predicting with expert advice (e.g., [3|), there are n experts, and on each round t the player selects 
an expert (say i), and obtains reward g t ^ from a bounded interval (say [—1, 1]). Typically, one uses 
an algorithm that proposes a probability distribution p t on experts, so the expected reward is p t ■ g t . 

Our algorithms apply to an unconstrained version of this problem: there are still n experts with 
payouts in [—1, 1], but rather than selecting an individual expert, the player can place a "bet" of 
x t: i on each expert i, and then receives reward ^ Xt,igt,i = Xt ■ gt- The bets are unconstrained 
(betting a negative value corresponds to betting against the expert). In this setting, a natural goal is 
the following: place bets so as to achieve as much reward as possible, subject to the constraint that 
total losses are bounded by a constant (which can be set equal to some starting budget which is to be 
invested). Our algorithms can satisfy constraints of this form because regret with respect to x — 
(which equals total loss) is bounded by a constant. 

It is useful to contrast our results in this setting to previous applications of online convex optimiza- 
tion to portfolio management, for example and J2). By applying algorithms for exp-concave 
loss functions, they obtain log-wealth within C(log(T)) of the best constant rebalanced portfolio. 
However, this approach requires a "no-junk-bond" assumption: on each round, for each investment, 
you always retain at least an a > fraction of your initial investment. While this may be realistic 
(though not guaranteed!) for blue-chip stocks, it certainly is not for bets on derivatives that can 
lose all their value unless a particular event occurs (e.g., a stock price crosses some threshold). Our 
model allows us to handle such investments: if we play x,i > 0, an outcome of g t = — 1 corresponds 
exactly to losing 100% of that investment. Our results imply that if even one investment (out of 
exponentially many choices) has significant returns, we will increase our wealth exponentially[| 

Notation and Problem Statement For the algorithms considered in this paper, it will be more 
natural to consider reward-maximization rather than loss-minimization. Therefore, we consider 
online linear optimization where the goal is to maximize cumulative reward given adversarially 
selected linear reward functions ft(x) = gt ■ x. On each round t = 1 . . . T, the algorithm selects a 
point xt £ R n , receives reward ft(xt) — gt ■ Xt, and observes g t . For simplicity, we assume g t ^ £ 
[—1, 1], that is, ||<7t||oo < 1. If the real problem is against convex loss functions £t(x), they can be 
converted to our framework by taking g t = — V£t(x t ) (see pseudo-code for Reward-Doubling), 
using the standard reduction from online convex optimization to online linear optimization lfl3l . 

We use the compressed summation notation g\ :t — X>s=i 5 s f° r b° m vectors and scalars. We study 
the reward of our algorithms, and their regret against a fixed comparator i: 

T T 

Reward = g t ■ x t and Regret (x) = g\-T • x — g t • x t - 
t=i t=i 

Comparison of Regret Bounds The primary contribution of this paper is to establish matching 
upper and lower bounds for unconstrained online convex optimization problems, using algorithms 
that require no prior information about the comparator point x. Specifically, we present an algo- 
rithm that, for any x £ W 1 , guarantees Regret(x) < O (\\x\\ 2 Vt \og((l + \\x\\ 2 )\/T)) . To obtain 

this guarantee, we show that it is sufficient (and necessary) that reward is Q(exp(\gi : T\/VT)) (see 
Theorem [TJ. This shift of emphasis from regret-minimization to reward-maximization eliminates 
the quantification on x, and may be useful in other contexts. 

Table [T] compares the bounds for Reward-Doubling (this paper) to those of two previous algo- 
rithms: online gradient descent 1 1 3 1 and projected exponentiated gradient descent IS] [12]. For each 

Our bounds are not directly comparable to the bounds cited above: a 0(log(T)) regret bound on log- 
wealth implies wealth at least 0(OPT/T), whereas we guarantee wealth like 0(OPT — VT). But more 
importantly, the comparison classes are different. 
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Assuming ||g t || 2 < 1: 





x = 


\\&h < R 


Arbitrary x 


Gradient Descent, r\ = 
Reward-Doubling 


rVt 
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RVT log 
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||x|| 2 Vrio g ("( 1+ f^ T ) 


Assuming ||g t ||oo < 1: 




x = 


Pill 


< R 




Arbitrary x 


Exponentiated G.D. 
Reward-Doubling 


Ry/T log n 
e 


Rs/Tlagn 
RVT log (»( 1 +*) T ) 




WxhVTlog^ 1 ^^) 



Table 1: Worst-case regret bounds for various algorithms (up to constant factors). Exponentiated 
G.D. uses feasible set {x : \\x\\\ < R}, and Reward-Doubling uses e, ; = ^ in both cases. 



algorithm, we consider a fixed choice of parameter settings and then look at how regret changes as 
we vary the comparator point x. 

Gradient descent is minimax-optimal (TJ when the comparator point is contained in a hypershere 
whose radius is known in advance (||i:||2 < R) and gradients are sparse (||<7t||2 < 1, top table). 
Exponentiated gradient descent excels when gradients are dense ( 1 1 1 1 oo < 1. bottom table) but the 
comparator point is sparse (||x|| i < R for R known in advance). In both these cases, the bounds for 
Reward-Doubling match those of the previous algorithms up to logarithmic factors, even when 
they are tuned optimally with knowledge of R. 

The advantage of Reward-Doubling shows up when the guess of R used to tune the compet- 
ing algorithms turns out to be wrong. When x = 0, Reward-Doubling offers constant regret 
compared to f2(vT ) for the other algorithms. When x can be arbitrary, only Reward-Doubling 
offers sub-linear regret (and in fact its regret bound is optimal, as shown in Theorem [8). 

In order to guarantee constant origin-regret, Reward-Doubling frequently "jumps" back to 
playing the origin, which may be undesirable in some applications. In Section [4] we introduce 
Smooth-Reward-Doubling, which achieves similar guarantees without resetting to the origin. 

Related Work Our work is related, at least in spirit, to the use of a momentum term in stochastic 
gradient descent for back propagation in neural networks OUT] [9). These results are similar in 
motivation in that they effectively yield a larger learning rate when many recent gradients point in 
the same direction. 

In Follow-The-Regularized-Leader terms, the exponentiated gradient descent algorithm with unnor- 
malized weights of Kivinen and Warmuth [8| plays x t +i — argmin a . gRre gi :t ■ x + ~(x log x — x), 

which has closed-form solution Xt+i = exp(— Tjgi^). Like our algorithm, this algorithm moves 
away from the origin exponentially fast, but unlike our algorithm it can incur arbitrarily large regret 
with respect to x — 0. Theorem [9] shows that no algorithm of this form can provide bounds like the 
ones proved in this paper. 

Hazan and Kale [5] give regret bounds in terms of the variance of the g t . Letting G = \gi-.t\ and 
H = Y^t=i 9t> tne y P rove regret bounds of the form 0(y/V) where V = H — G 2 /T. This result 
has some similarity to our work in that G/VT = \/H — V, and so if we hold H constant, then 
when V is low, the critical ratio G/VT that appears in our bounds is large. However, they consider 
the case of a known feasible set, and their algorithm (gradient descent with a constant learning rate) 
cannot obtain bounds of the form we prove. 

2 Reward and Regret 

In this section we present a general result that converts lower bounds on reward into upper bounds 
on regret, for one-dimensional online linear optimization. In the unconstrained setting, this result 
will be sufficient to provide guarantees for general n-dimensional online convex optimization. 
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Theorem 1. Consider an algorithm for one-dimensional online linear optimization that, when run 
on a sequence of gradients g±,g2,--., gr, with g t £ [— 1, 1] for all t, guarantees 

Reward > Kexp (7|gi : r|) — e, (1) 

where 7, k > and e > are constants. Then, against any comparator x 6 [— R, R], we have 

Regret(i) < ~ (tog \J^j - lj + e, (2) 

letting OlogO = when R = 0. Further, any algorithm with the regret guarantee of Eq. Q must 
guarantee the reward ofEq. Q. 

We give a proof of this theorem in the appendix. The duality between reward and regret can also be 
seen as a consequence of the fact that exp(a;) and y log y — y are convex conjugates. The 7 term 
typically contains a dependence on T like 1/y/T. This bound holds for all R, and so for some small 
R the log term becomes negative; however, for real algorithms the e term will ensure the regret 
bound remains positive. The minus one can of course be dropped to simplify the bound further. 



3 Gradient Descent with Increasing Learning Rates 

In this section we show that allowing the learning rate of gradient descent to sometimes increase 
leads to novel theoretical guarantees. 

To build intuition, consider online linear optimization in one dimension, with gradients 
<?i> <72> • • ■ ,9t, all in [—1, 1]. In this setting, the reward of unconstrained gradient descent has a 
simple closed form: 

Lemma 2. Consider unconstrained gradient descent in one dimension, with learning rate rj. On 

round t, this algorithm plays the point xt — r\g\-.t-\. Letting G = \g\-t\ and H = Y^t=i 9t> ^ ne 
cumulative reward of the algorithm is exactly 

Reward = | (G 2 - H) . 

We give a simple direct proof in Appendix A. Perhaps surprisingly, this result implies that the reward 
is totally independent of the order of the linear functions selected by the adversary. Examining the 
expression in Lemma|2] we see that the optimal choice of learning rate r\ depends fundamentally on 
two quantities: the absolute value of the sum of gradients (G), and the sum of the squared gradients 
(H). If G 2 > H, we would like to use as large a learning rate as possible in order to maximize 
reward. In contrast, if G 2 < H, the algorithm will obtain negative reward, and the best it can do is 
to cut its losses by setting 77 as small as possible. 

One of the motivations for this work is the observation that the state-of-the-art online gradient de- 
scent algorithms adjust their learning rates based only on the observed value of H (or its upper bound 
T); for example J4j [TO) . We would like to increase reward by also accounting for G. But unlike H, 
which is monotonically increasing with time, G can both increase and decrease. This makes simple 
guess-and-doubling tricks fail when applied to G, and necessitates a more careful approach. 



3.1 Analysis in One Dimension 

In this section we analyze algorithm Reward-Doubling- ID (Algorithm [TJ, which consists of a 
series of epochs. We suppose for the moment that an upper bound H on H = J2t=i 9t i s known 
in advance. In the first epoch, we run gradient descent with a small initial learning rate rj = r\\. 
Whenever the total reward accumulated in the current epoch reaches r/H, we double r\ and start a 
new epoch (returning to the origin and forgetting all previous gradients except the most recent one). 

Lemma 3. Applied to a sequence of gradients gi,g2, ■ ■ ■ , gr> all in [—1, 1], where H = Y^t—i 9t — 
H, Reward-Doubling- ID obtains reward satisfying 

Reward = x t gt > -ryii/exp fa j _ rjiH, (3) 
t=i v VH ' 

fora = log(2)/v / 3. 
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Algorithm 1 Reward-Doubling- ID 



Algorithm 2 Reward-Doubling 



Parameters: initial learning rate 771, upper 

bound H > Y^=i9t- 

Initialize Xi <— 0, i <— 1, and Q\ <— 0. 

for t = l,2,...,Tdo 

Play x t , and receive reward x t gt- 
Qi <- Qi +_x t g t - 
if Qi < then 

x t +i <- x t + rng t . 
else 
i <- i + 1. 

77,; <- 2r?j_ x ; Qi <- 0. 
s t+1 <- + 77^4. 



Parameters: maximum origin-regret 

for 1 < i < n. 

for i = 1, 2, . . . , n do 

Let Ai be a copy of algorithm 
Reward-Doubling- ID-Guess 
(see Theorem[4]i, with parameter e,. 

for i = l,2,...,Tdo 

Play x t , with x tl i selected by A4. 
Receive gradient vector g t = — V/t(xt). 
for i = 1,2, . . . , n do 
Feed back g t i to Aj. 



Proof. Suppose round T occurs during the fc'th epoch. Because epoch i can only come to an end if 
Qi > ViH, where rji = 2 % ~ lr qi, we have 



Reward = ^ Q i > ^ 2*"^^ + Qfc = (2 fc_1 - l) »7i# + Qfc . (4) 

i=l \i=l / 

We now lower bound Q^. For i = 1, . . . , k let £j denote the round on which is initialized to 0, 
with ti = 1, and define tk+i = T. By construction, Q. L is the total reward of a gradient descent 
algorithm that is active on rounds ti through t{+i inclusive, and that uses learning rate rji (note that 
on round ij, this algorithm gets reward and we initialize Qi to on that round). Thus, by Lemma 
[2] we have that for any i, 

Applying this bound to epoch k, we have Qk > ~\rjkH = —2 k ~ 2 7]iH. Substituting into Q gives 
Reward > r tl H{2 k - 1 - 1 - 2 k ~ 2 ) = m H{2 k - 2 - 1) . (5) 

\91-.t\ 

epoch i + 1 would have begun earlier). Thus, again using Lemma|2j 



We now show that k > l9 ^£} . At the end of round t i+ i — 1, we must have had Qi < rjiH (otherwise 



j((gu:u +1 -i) 2 -H)<rnH 

so \gu:t i+1 -i\ < V3H. Thus, 

k 

\9i-.t\ <J2\9 ti :t i+l -i\ <kV3R. 

i=l 

Rearranging gives k > ' 9 i^J , and combining with Eq. d5l) proves the lemma. □ 

We can now apply Theorem[T]to the reward (given by Eq. ([3])) of Reward-Doubling- 1 D to show 

Regret(x) < bR^ffi ^log ^ ~ lj + ViH (6) 

for any x g [-R, R], where b = a -1 = V3/ log(2) < 2.5. When the feasible set is also fixed in 
advance, online gradient descent with a fixed learning obtains a regret bound of 0{RyT). Suppose 
we use the estimate H = T. By choosing 771 = ^, we guarantee constant regret against the origin, 
x = (equivalently, constant total loss). Further, for any feasible set of radius R, we still have 
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worst-case regret of at most 0(i?VTlog((l + R)T)), which is only modestly worse than that of 
gradient descent with the optimal R known in advance. 

The need for an upper bound H can be removed using a standard guess-and-doubling approach, at 
the cost of a constant factor increase in regret (see appendix for proof). 

Theorem 4. Consider algorithm REWARD-DOUBLING- ID-GUESS, which behaves as follows. On 
each era i, the algorithm runs REWARD-DOUBLING-ID with an upper bound of Hi = 2 I_1 , and 
initial learning rate rj\ — e2~ 2 \ An era ends when Hi is no longer an upper bound on the sum of 

squared gradients seen during that era. Letting c = this algorithm has regret at most 

Regret < cRsjH + 1 (log [ — (2H + 2) 5/2 ) - 1 ] + e. 



e 



3.2 Extension to n dimensions 



To extend our results to general online convex optimization, it is sufficient to run a separate copy of 
Reward-Doubling- ID-Guess for each coordinate, as is done in Reward-Doubling (Algo- 
rithm|2|. The key to the analysis of this algorithm is that overall regret is simply the sum of regret 
on n one-dimensional subproblems which can be analyzed independently. 

Theorem 5. Given a sequence of convex loss functions fi , / 2 , . . . , fx from R n to M, 
Reward-Doubling with e, = ^ has regret bounded by 

n 

Regret(x) < e + \x t \^/H~+l (log ("\xi\(2H t + 2) 5 / 2 ) - 1 



< e + c\\x\\ 2 VH + n flog ( n \\x\\ 2 2 (2H + 2) 5 / 2 ) - 1 



f or c = ^PI' where Hi = SLi 9t,i and H = ELi Ibtlll 
Proof. Fix a comparator x. For any coordinate i, define 

T T 

Regret,; = ^ %i9t,i - ^ ^MffM 



t=i t=i 



Observe that 



n T T 

^2 Regret i = a; ■ g t - x t ■ g t = Regret (x) . 

i=l t=l t=l 

Furthermore, Regret, is simply the regret of Reward-Doubling-ID-Guess on the gradient se- 
quence gi j i,g2,i, ■ ■ ■ , gr,i- Applying the bound of Theoremj4]to each Regret^ term completes the 
proof of the first inequality. For the second inequality, let H be a vector whose i th component is 
\fWi + 1, and let x £ W 1 where Xi = Using the Cauchy-Schwarz inequality, we have 

n 

Y,\hWH l + l = x- H < ||i|| 2 ||^||2 = ||£|| 2 VF+n. 

i=l 

This, together with the fact that ^((^1(2^ + 2) 5 / 2 ) < log(||x|||(2H + 2) 5 / 2 ), suffices to prove 
second inequality. □ 

In some applications, n is not known in advance. In this case, we can set £j = 4 f° r the ith 
coordinate we encounter, and get the same bound up to constant factors. 

4 An Epoch-Free Algorithm 

In this section we analyze Smooth-Reward-Doubling, a simple algorithm that achieves bounds 
comparable to those of TheoremH without guessing-and-doubling. We consider only the 1-d prob- 
lem, as the technique of Theorem 5] can be applied to extend to n dimensions. Given a parameter 
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1] > 0, we achieve 

Regret < RVT log ) - 1 ) + 1.76r?, (7) 



V 

for all T and R, which is better (by constant factors) than Theorem [4] when g t € {—1, 1} (which 
implies T = H). The bound can be worse on a problems where H < T. 

The idea of the algorithm is to maintain the invariant that our cumulative reward, as a function of 
g±:t and t, satisfies Reward > N(gi- t ,t), for some fixed function N. Because reward changes by 
gtXt on round i, it suffices to guarantee that for any g £ [—1, 1], 

N(g 1:t , t) + gx t+1 > N(g 1:t +g,t+l) (8) 

where Xt+i is the point the algorithm plays on round t + 1, and we assume N(0, 1) = 0. 

This inequality is approximately satisfied (for small g) if we choose 

dN(g 1:t +g,t) N(g 1:t + g,t) - N(g 1:t ,t) N(g 1:t + g,t + 1) - N(g la ,t) 
xt+i = ~ ~ ~ • 

og g g 

This suggests that if we want to maintain reward at least N(gi :t , t) = j(exp(\gx : t\/ yi) — 1) , we 
should set Xt+i ~ sign(gi :t )t~ 3 / 2 exp (^^-^j. The following theorem (proved in the appendix) 
provides an inductive analysis of an algorithm of this form. 

Theorem 6. Fix a sequence of reward functions ft{x) — gtx with g t G [—1, 1], and let Gt = \gi-.t\- 
We consider SMOOTH-REWARD-DOUBLING, which plays on round 1 and whenever G t = 0; 
otherwise, it plays 

x t+1 =r]siga(g 1:t )B(Gt,t + 5) (9) 
with r\ > a learning-rate parameter and 

B(G,t) = -Lexp(J:). (10) 

Then, at the end of each round t, this algorithm has 

Reward(i) > rj — - — exp ( * ) — 1.76?7. 

Two main technical challenges arise in the proof: first, we prove a result like Eq. ([8]) for N(gi- t , t) = 
(1/t) exp (\gi:t\ / Vt) ■ However, this Lemma only holds for t > 6 and when the sign of <?i :t doesn't 
change. We account for this by showing that a small modification to N (costing only a constant over 
all rounds) suffices. 

By running this algorithm independently for each coordinate using an appropriate choice of rj, one 
can obtain a guarantee similar to that of Theorem [5] 

5 Lower Bounds 

As with our previous results, it is sufficient to show a lower bound in one dimension, as it can then 
be replicated independently in each coordinate to obtain an n dimensional bound. Note that our 
lower bound contains the factor \og(\x\\/T), which can be negative when x is small relative to T, 
hence it is important to hold x fixed and consider the behavior as T — >• oo. Here we give only a 
proof sketch; see Appendix A for the full proof. 

Theorem 7. Consider the problem of unconstrained online linear optimization in one dimension, 
and an online algorithm that guarantees origin-regret at most e. Then, for any fixed comparator x, 
and any integer T , there exists a gradient sequence {g t } G [— 1, 1] T of length T > T /or which 
the algorithm 's regret satisfies 



Regret(i:) > 0.336|x| 
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Proof. (Sketch) Assume without loss of generality that x > 0. Let Q be the algorithm's reward 
when each g t is drawn independently uniformly from {—1,1}. We have E[Q] = 0, and because the 
algorithm guarantees origin-regret at most e, we have Q > — e with probability 1. Letting G = g\.T, 
it follows that for any threshold Z = Z(T), 

= E[Q] 

= E[Q\G < Z] • Pr[G < Z\ + E[Q\G > Z] • Pr[G > Z\ 

> -ePr[G < Z]+E[Q\G > Z] ■ Pr[G > Z) 

> -e + E[Q\G > Z] ■ Pr[G > Z] . 



Equivalently, 



We choose Z(T) = VkT, where fc = log(^^)/ log(p" 1 )J . Here R = \x\ and p > is a 
constant chosen using binomial distribution lower bounds so that Pr[G > Z] >p k . This implies 

E[Q\G >Z}< ep- k =eexp(fclogp- 1 ) < RVf . 



This implies there exists a sequence with G > Z and Q < RyT. On this sequence, regret is at least 

Gx-Q>RVkf-RVT = n(RVkf). □ 

Theorem 8. Consider the problem of unconstrained online linear optimization in W 1 , and consider 
an online algorithm that guarantees origin-regret at most e. For any radius R, and any Tq, there ex- 
ists a gradient sequence gradient sequence {gt} £ ([— 1, 1]™) T of length T > Tq, and a comparator 
x with || x ||i = R,for which the algorithm's regret satisfies 




Regret(ii) > 0.336^ {x^, 
»=i \ 

Proof. For each coordinate i, Theorem |7] implies that there exists a T > T and a sequence of 
gradients g t .i such that 



T T 



T 

Tlog' " 



^ Xi9t,i - ^ gf,»gt,» > 0.336|i)i 

4=1 t=l 

(The proof of Theorem |7] makes it clear that we can use the same T for all i.) Summing this 
inequality across all n coordinates then gives the regret bound stated in the theorem. □ 

The following theorem presents a stronger negative result for Follow-the-Regularized-Leader algo- 
rithms with a fixed regularizer: for any such algorithm that guarantees origin-regret at most after 
T rounds, worst-case regret with respect to any point outside [— er, er] grows linearly with T. 
Theorem 9. Consider a Follow-The-Regularized-Leader algorithm that sets 

x t = argmin(5i : t_i:r + ipr(x)) 

X 

where ipT is a convex, non-negative function with V't(O) = 0. Let €t be the maximum origin-regret 
incurred by the algorithm on a sequence ofT gradients. Then, for any x with \x\> €t, there exists a 
sequence of T gradients such that the algorithm's regretwith respect to x is at least ^=^(|x| — ey). 

In fact, it is clear from the proof that the above result holds for any algorithm that selects x t+ i purely 
as a function of g 1:t (in particular, with no dependence on t). 



6 Future Work 



This work leaves open many interesting questions. It should be possible to apply our techniques 
to problems that do have constrained feasible sets; for example, it is natural to consider the uncon- 
strained experts problem on the positive orthant. While we believe this extension is straightforward, 
handling arbitrary non-axis-aligned constraints will be more difficult. Another possibility is to de- 
velop an algorithm with bounds in terms of H rather than T that doesn't use a guess and double 
approach. 
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A Proofs 



This appendix gives the proofs omitted in the body of the paper, with the corresponding lemmas and 
theorems restated for convenience. 

Theorem [I] Consider an algorithm for one-dimensional online linear optimization that, when run 
on a sequence of gradients g\ , g% , • • • , gT, with g t € [—1,1] far all t, guarantees 

Reward > nexp (7|<?i : t|) — e , *Q} 

where 7, k > and e > are constants. Then, against any comparator x £ [-R, R], we have 

Regret(x) < ^ ^log f-^j - l) + e, @ 

letting OlogO = when R = 0. Further, any algorithm with the regret guarantee of Eq. Q must 
guarantee the reward ofEq. Q. 

Proof. Let Gt — |<7i:t|- By definition, given the reward guarantee ofEq. ([T} we have 

Regret < RG T - nexy>(~/G T ) + e. (11) 

If R = 0, then Eq. |2} follows immediately. Otherwise, note this is a concave function in Gt, and 
setting the first derivative equal to zero shows 

G' = ilog(« 

7 \7 K 

maximizes regret (for large enough R we could have G* > T, and so this G* is not actually 
achievable by the adversary, but this is fine for lower bounding regret). Plugging G* into Eq. ( fTTj ) 
and simplifying yields the bound of Eq. For the second claim, suppose Eq. Q holds. Then, 
again by definition, we must have 

Reward > RG - -log f — ) + - - e. (12) 
7 V7^/ 7 

This bound is a concave function of R, and since it holds for any R > by assumption, we can 
choose the R that maximizes the bound, namely R* — 7K exp(7G). Note 

— log (—) - — log (ex P ( 7 G)) = R*G, 
7 V7^/ 7 

and so plugging R* into Eq. ( |T2] > yields 



Reward > — R* — e = k exp (7G) — e. 
7 



□ 



Lemma [2j Consider unconstrained gradient descent in one dimension, with learning rate r). On 

round t, this algorithm plays the point x t — f7<7i;t-i- Letting G = \gi-.t\ and H = Y^t=i 9t> ^ ne 
cumulative reward of the algorithm is exactly 

Reward = | (G 2 - H) . 
Proof. The algorithm's cumulative reward after T rounds is 

Xt9t = 9tV9i-.t-i = 2 (9i-.t) 2 - Y 9t ) ■ ( 13 ) 

t—i t=i V t=i / 

To verify the second equality, note that (gi-.T) 2 ~ (.9i:T-i) 2 = 9r + 2gT(ffi:T-i), so on round T 
the right hand side increases by r/gxigi-.T-i), as does the left hand side. The equality then follows 
by induction on T. □ 
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It is worth noting that the standard R v T bound can be derived from the above result fairly easily. 
We have 

Regret < RG - | (G 2 - H) 

< ^ff + max (rG- 
~ 2 G \ 2 

n „ R 2 

< 1 H-\ , 

- 2 2r) 



where the max is achieved by taking G = R/r/. Taking r\ = Rj\JT then gives the standard bound. 
However, this bound significantly underestimates the performance of constant-learning-rate gradient 
descent when G is large. This is in contrast to our regret bounds, which are always tight with respect 
to their matching reward bounds. 

Theorem|4j Consider algorithm REWARD-DOUBLING- ID-GUESS, which behaves as follows. On 
each era i, the algorithm runs REWARD-DOUBLING-ID with an upper bound of Hi = 2* , and 
initial learning rate rf\ = e2~ 2 \ An era ends when Hi is no longer an upper bound on the sum of 

squared gradients seen during that era. Letting c = ^ this algorithm has regret at most 
Regret < cRy/JTTl (log \-{2H + 2) 5/2 ) - 1 ] + e. 



Proof. Suppose round T occurs in era k, and let ti be the round on which era i starts, with tk+i = 
T+ 1. Define H t = <? 2 . To prove the theorem we will need several inequalities. First, note 

that H = J2i =1 Hi > J2i=l Hi = 2 k - x - 1, or 2 k ~ 1 < H + 1. Thus, 

^ V2-1 ~V2-l~ a/2-1 



Next, note that for any i we have 

\[H,. 



l 2 i_i +24 ^ l 2 2. 5 fc < I( 2 (iI + l))( 5 /2). 



Note that the bound of Lemma |5] applies for all T where H < H, and thus so does Eq. ([6}. Thus, 
we can apply this bound to the regret in era k on rounds tk through T, as well as on the regret in 
each earlier era. Then, total regret with respect to the best point in [— R, R] is at most the sum of the 
regret in each era, so 



k 



Regret < £ R^H~ [ log ( -^-^ VlHi 

-(2H + 2f/ 2 ) -l) + Tj\Hi 




*(2H + 2f/ 2 )-l)+J2v\H 



i=l 



Finally, because Hi < Hi + 1 < 2Hi — 2 l , we have J2i=i 7 l\Hi < J2i=i e2 4 < e, which 
completes the proof. □ 

Theorempj Fix a sequence of reward functions ft(x) = gtx with g t G [—1, 1], and let Gt — \gi-.t\- 
We consider Smooth-Reward-Doubling, which plays on round 1 and whenever G t = 0; 
otherwise, it plays 

x t+ i =rjsign(g 1:t )B(G t ,t + 5) ^ 
with r) > a learning-rate parameter and 
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Then, at the end of each round t, this algorithm has 

Reward(i) > rj — - — exp ( * ) — 1.76??. 
t + 5 \ v * + 5 / 

Proof. We present a proof for the case where r\ = 1; since 77 simply scales all of the xt played by the 
algorithm (and hence, reward), the result for general rj follows immediately. We use the minimum 
reward function 

N(G 7 t) = - t exp^y (14) 

The proof will be by induction on t, with the induction hypothesis that the cumulative reward of the 
algorithm at the end of round t satisfies 

Reward(i) > N(G t ,t + 5) - e 1:t , (15) 

where e\ — N(l, 6) and for t > 1, e t +i = ({t + 5) with 

?(T) = 7TT cxp (7^t)"^ + i- 

We will then show that the sum of et's is always bounded by a constant. 

For the base case, t = 1, we play x — so end the round with zero reward, while the RHS of 
Eq.([T5])is JV(|5i|,6) - JV(1,6) < 0. 

Now, suppose the induction hypothesis holds at the end of some round t > 1. Without loss of 
generality, suppose gx-t > so G t = gi- t . We consider two cases. First, suppose Gt > and 
Gt + 9t+i > (so g t+ i > —Gt)- In this case, g\- t does not change sign when we add gt+i', thus, an 
invariant like that of Eq. <|8j is sufficient; we prove such a result in Lemma [T0| (given below). More 
precisely, we play Xt+i according to Eq. d9]l, and 

Reward(£ + 1) > N(Gt,t + 5) — ei-.t + gt+iXt+i IH and update rule 

> N(G t + g t +i,t + 5 + 1) - £i :t Lemma [TOl with r = t + 5. 

> AT(G i+ i,t+ 5 + 1) - ei:t+i, since e t +i > 0. 

For the remaining case, we have Gt + gt+i < 0, implying g t +\ < —Gt < 0. In this case, we suffer 
some loss and arrive at Gt+i = \Gt + gt+i\ — ~ Gt- Lemma 11 (below) provides the key 

bound on the additional loss when the sign of gi- t changes. If Gt > 0, we have 

Reward(t + 1) > N(Gt, t + 5) — ei-.t + gt+iXt+i IH and update rule 

> N(-g t+ i - G t , t + 5 + 1) - ei:f+i Lemma [TT1 with r = t + 5 

= JV(G t+1 ,t + 5 + l)-ei ;t+ i. 

If Gt = 0, we can take g t +\ non-positive without loss of generality, and playing x t +i = is no 
worse than playing -6(0, t + 5), and so we conclude Eq. ([15) holds for all t. Finally, 



00 

E 



e t < e(r) = ^| - 2 7 + 2Ei + log(6) < 1.50. 



where 7 is the Euler gamma constant and Ei is the exponential integral. The upper bound can be 
found easily using numerical methods. Adding e\ = exp(l/-\/6)/6 < 0.26 gives £\-t < 1.76 for 
anyT. " " □ 

Lemma 10. Lef G > anrf r > 6. Then, for any g £ [—1,1] ™c/i f/zaf G + g > 0, 

iV(G, r) + 5 B(G, r) - N{G + .g, r + 1) > 

where N is defined by Eq. \\<ty and B is defined by Eq. ( |10) . 



Proof. We need to show 
1 / 

r exp ^; 1 T*/*~*\yfFj r+i^vv^+i 



1 / G \ g ( G\ 1 ( G . , 

ex P ~F= + ^T7? ex P \-r=] r~r cxp ; — — > 0. 
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or equivalently, multiplying by r 3 / 2 (l + r)/ exp(G/v / r) > 0, 

A = ^(l + r)+.g(l + r)-r 3 / 2 expf^i£-^) ; ■ I). 



Vr + 1 0" 
Since r + 1 > r, the exp term is maximized when G = 0, so 

A> ( 5 + ^)(l + r)-T 3 / 2 exp^-^J==^ . (16) 

Now, we consider the cases where g > and g < separately. First, suppose g > 0, so g/y/r + 1 6 
[0, 1], and we can use the inequality exp (a;) < 1 + x + x 2 for x £ [0, 1], which gives 

A>g + gT+V^ + r^ 2 -T 3 / 2 (l+ 9 ' 5 ' 



V^TT t + 1 

>g + gT + V^ + r 3/2 - t 3 / 2 h + JL + Ij 

= .9 + .9^- + Vt + r 3/2 - r 3/2 -gr — ^/r 
= 3>0. 

Now, we consider the case where g < 0. In order to show A > in this case, we need a tight upper 
bound on exp(y) for y € [—1, 0]. To derive one, we note that for x > 0, exp(x) > 1 + x + |x 2 
from the series representation of e x , and so exp(— x) < (1 + x + |a; 2 ) _1 . Thus, for ye [—1, 0] we 
have exp(y) < (1 — y + |y 2 ) -1 = Q(y). Then, starting from Eq. ( [To} , 



A>(. 9 + Vr)(l + r)-r 3 / 2 g(^=) 



-1 



Let A 2 = AQ ( ^§+1 ) ■ Because A 2 and A have the same sign, it suffices to show A 2 > 0. We 
have 

a -( 1 -vAT + 25TT)) fa + v9)<1 + T) - Ta/2 

= (1 + r - .gVrTT + ig 2 ) (.9 + V?) - t 3/2 . 

First, note 

4~ A 2 = 1 + % + 9Vr + r - 2g-y/l + r - y/ry/l + r. 

Since g < 0, we have —2g\fr + 1 + .g^/r > 0, and — \Jt\Jt + 1 > 0, and so we conclude 

that A 2 is increasing in g, and so taking g = —1 we have 

A 2 > (| + r + x/7TT) (-1 + y/r) - r 3 / 2 

Taking the derivative with respect to r reveals this expression is increasing in t, and taking r = 6 
produces a positive value, proving this case. □ 

Lemma 11. For any g € [—1,0] ant/ G > smc/i f/jaf G + ,g < 0, and any r > 1, 

7V(G, r) + gB(G, r) > iV(-p - G, r + 1) - ?(r) 
where N is defined by Eq. and B is defined by Eq. ( |10| >, and 

. 1 / 1 \ 1 1 

e(r) = — — exp 



r + 1 ^ VV^+T/ r t 3 / 2 



Proo/ We have 

iV(-.g - G, r + 1) - 7V(G, r) - «?B(G, r) 



1 f-g-G\ 1 /G\ 5 /G 
ex P / r^r ex P -F ex P 
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and since this expression is increasing as g decreases, and g > — 1 in any case, 

1 / 1-G\ 1 ( G\ 1 ( G 

~ —I 6XP { VfTT J " r CXP J + 7^ 6XP [Vr 

and since t 3 / 2 > t, taken together the second two terms increase as G decreases, as does the first 
term, so since G > 0, 

and re-arranging proves the lemma. □ 
Theorem[9j Consider a Follow-The-Regularized-Leader algorithm that sets 

x t = argmin(g 1:t _ 1 x + V'T(a;)) 

X 

where ipx is a convex, non-negative function with V't(O) = 0. Let ex be the maximum origin-regret 
incurred by the algorithm on a sequence ofT gradients. Then, for any x with \x\> ex, there exists a 
sequence of T gradients such that the algorithm's regretwith respect to x is at least ^=i(|^| — ex). 

Proof. For simplicity, we will prove that regret is at least ^{\x\ — ex) when T is even; if T is odd, 
we simply take gx = and consider the first T — 1 rounds. 

Let T — 2M. We will consider two gradient sequences. First, suppose g t = 1 for t < M, and 
g t = —1 otherwise. Observe that for any r, we have gi-u-r = gi-.M+n which implies xm-t+i = 
Xm+t+i- Thus, the algorithm's total reward is 

T M T 

t=l t=l t=M+l 

M-l 

= X\ — Xm+1 + / XM-r+l — XM+r+1 



r=l 



X\ - XM+1 



Because x\ = 0, we get that on this sequence the algorithm has origin-regret x = xa/+i, and so by 
assumption x < ex- 

Next, suppose g\ = 1 for t < M , and g t = otherwise. For this sequence, we will have x t < x < 
ex for all t, so total reward is at most Alex- For any positive x with i > ex, this means that regret 
with respect to x is at least 

xM - Me T = M(\x\- e T ) . 

For x < — ex, we can use a similar argument with the sign of the gradients reversed (for both 
gradient sequences) to get the same bound. □ 

In proving Theorem [7] we will use the following lemma. 

Lemma 12. Let Gx — Y^j—i Si be the sum ofT random variables, each drawn uniformly from 
{— 1, 1}. Then, for any integer k that is a factor ofT, we have 

Pr[G T > Vkf] > p k . 

where p= ^ = 0.109375. 

Proof. First, for any T define px — Pr[Gx > VT], and define 

p = inf px ■ 

T6Z+ 

For any T, we have px > 2~ T trivially, and by the Central Limit Theorem, liniT-i-oo Pt = 1 — 
A/o,i(l) > 0, where A/o,i is the standard normal cumulative distribution function. It follows that 
p > 0, and using numerical methods we find p = p^ = Zg = 0.109375. 
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Now, divide the length T sequence into k sequences of length ? . Let Zi be the sum of gradients for 



the zth of these sequences. Observe that if Zi > y ^ for all i, then Gt = J2i=i > ky ^ 
VkT- Furthermore, for any i, we have 









Pr 


Zi * fk_ 


= Pr 



Gt > 



Thus, 



Pr 



g > Vkf] > Yl Pr 



1=1 



> A/— - 



> 



P • 



>P 



□ 



Theorem |7j Consider the problem of unconstrained online linear optimization in one dimension, 
and an online algorithm that guarantees origin-regret at most e. Then, for any fixed comparator x, 
and any integer To, there exists a gradient sequence {g t } £ [— 1, 1] T of length T > To for which 
the algorithm 's regret satisfies 



Regret(i:) > 0.336|x| 



\ 



Tlog 



\x\Vt' 



Proof Let k = k(T) = log(^f^)/ log(p _1 )J , and choose T > T large enough so that 4 < 

k < T and also so that T is a multiple of k (the latter is possible since fc(T) grows much more 
slowly than T). Let Q be the algorithm's reward when each g t is drawn uniformly from {—1, 1}. 
Let G = gi.T- As shown in the proof sketch, we have 

E[Q\G > Vkf] < 



Pi\G > VkT 



By Lemma[l2| Pr[G > VkT] > p k . Thus, 

E[Q\G > VkT} < ep- k = eexp(fclogp- 1 ) < Ry/f . 



If the algorithm guaranteed Q > RvT whenever G > VkT, then we would have E[Q|G > 
V kT] > R\/T, a contradiction. Thus, there exists a sequence where G > VkT and Q < RVT, so 
on this sequence we have 

Regret > RVkf - RVT = RVT(Vk - 1) 
Because k > 4, we have \Vk > 1 or Vk — 1 > \Vk, so regret is at least \RV kT 



bRJ T log (^^-^-^j , where b —\ \j \ ^ p -\ > 0.336 (and p is the constant from Lemma 
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□ 
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